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1  Statement  of  the  problems  studied 

This  project  has  investigated  novel  design  technologies  for  energy-efficient  VLSI  systems. 
Its  primary  focus  has  been  on  charge-recovery  circuits.  These  circuits  achieve  higher  en¬ 
ergy  efficiency  than  their  conventional  counterparts  by  steering  currents  to  flow  across  de¬ 
vices  with  low  voltage  drops,  while  recycling  undissipated  energy  in  parasitic  capacitors. 
Previous  investigations  into  charge  recovery  have  resulted  in  complex  circuits  and  archi¬ 
tectures  that  are  impractical  for  high-speed  design.  This  project  has  led  to  the  discovery 
of  practical  low-complexity  charge-recovery  circuits  which  achieve  high  energy  efficiency 
and  achieve  clock  frequencies  in  excess  of  1GHz.  The  results  of  this  research  have  been 
validated  through  silicon  prototyping  and  experimentation.  For  four  of  the  inventions  re¬ 
sulting  from  this  project,  the  University  of  Michigan  has  filed  utility  and  provisional  patent 
applications  with  the  US  Patent  and  Trademark  Office. 


2  Summary  of  the  most  important  findings 

The  main  contributions  of  this  research  project  were  the  following. 

•  Boost  Logic  for  GHz  charge-recovery  operation:  Boost  logic  is  a  novel  dynamic 
charge-recovery  family  that  operates  with  a  two-phase  power-clock  waveform.  In 
post-layout  Spice  simulations  of  16-bit  multipliers  in  a  130nm  bulk  silicon  process 
at  1GHz,  Boost  Logic  implementations  achieve  5-10  times  higher  energy  efficiency 
than  minimum-energy  pipelined  and  voltage-scaled  static  CMOS  at  the  expense  of 
2-3  times  longer  latency.  In  a  fully-integrated  test  chip  implemented  using  a  130nm 
bulk  silicon  process  and  on-chip  inductors,  chains  of  Boost  Logic  gates  operate  at 
clock  frequencies  up  to  1.3GHz  with  a  1.5V  supply.  When  resonating  at  850MHz 
with  a  1.2V  supply,  the  Boost  Logic  test  chip  achieves  60%  charge  recovery.  This 
Boost  Logic  is  the  fastest  charge-recovery  design  reported  to  date. 

•  Charge-recovery  ASIC  design  methodology:  This  ASIC  methodology  relies  on  a 
novel  charge-recovery  flip-flop  design  and  a  metal-only  clock  distribution  network. 
By  enabling  the  recovery  of  charge  from  the  clock  distribution  network,  this  method¬ 
ology  yields  ASIC  designs  with  minimal  clock  power  dissipation.  A  resonant-clocked 
ASIC  for  the  Discrete  Wavelet  Transform  has  been  designed  using  this  methodology 
and  industry-standard  tools.  On-chip  circuitry  is  used  to  generate  a  single-phase  reso¬ 
nant  clock  of  sinusoidal  shape.  Correct  operation  has  been  confirmed  experimentally 
for  clock  frequencies  up  to  300MHz,  with  measured  clock  power  savings  ranging 
between  60%  and  75115MHz,  depending  on  primary  input  activity. 

•  GHz-class  resonant  clocking:  The  potential  of  resonant  clocking  for  energy  efficient 
design  at  GHz-class  operating  frequencies  has  been  evaluated  through  chip  mea¬ 
surements  of  a  1.1  GHz  resonant  clock  distribution  network  in  a  130nm  bulk  silicon 
process.  Evergy  savings  in  the  order  of  45%  have  been  demonstrated. 

•  Low-power  charge-recovery  static  memory  (SRAMs):  The  proposed  SRAM  archi¬ 
tecture  relies  on  balanced  loading  to  achieve  high-efficiency  charge  recovery  from 
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Figure  1:  Schematic  of  Boost  logic  gate  with  pseudo-NMOS  pulldown. 

the  bit  lines.  In  Spice  simulations  of  SRAM  arrays  in  a  250nm  bulk  silicon  process, 
the  proposd  architecture  dissipates  27%  less  power  than  its  conventional  counterpart. 

The  remainder  of  this  section  provides  more  details  about  each  of  our  main  contribu¬ 
tions. 

2.1  Boost  Logic 

Charge-recovery  architectures  reduce  energy  dissipation  by  steering  currents  across  devices 
with  low  voltage  differences  whilte  recycling  the  energy  stored  in  their  capacitors  [1,  4]. 
The  efficient  operation  of  these  designs  is  the  result  of  an  energy-speed  trade-off.  This 
project  has  led  to  the  discovery  of  Boost  Logic,  a  high-speed  charge-recovery  circuit  family 
that  is  capable  of  operating  at  GHz-class  frequencies  with  high  efficiency  by  trading  off 
power  dissipation  for  latency  of  operation  [11,  12,  13,  14]. 

Boost  Logic  achieves  significant  energy  savings  overvoltage-scaled  static  CMOS  across 
a  range  of  frequencies  much  higher  than  previously  demonstrated  in  charge-recovery  liter¬ 
ature.  A  unique  feature  of  Boost  Logic  that  enables  energy-efficient  and  high-throughput 
operation  is  an  aggressively  scaled,  conventionally  switching  “Logic”  stage  that  operates 
in  tandem  with  a  charge-recovery  “Boost”  stage.  Logic  performs  the  logical  evaluation 
of  a  Boost  Logic  gate  operating  at  an  ultra-low  DC  supply  voltage  of  approximately  one 
threshold  voltage  Vth-  After  Logic  pre-resolves  the  differential  outputs  of  a  Boost  Logic 
gate  to  the  level  of  about  one  threshold  voltage,  Boost  amplifies  the  difference  between 
the  output  nodes  to  the  full  rail  in  an  energy-efficient  charge-recovery  manner,  providing  a 
large  overdrive  to  fanout  gates  and  thereby  reducing  delay  in  their  Logic  stages. 

Figure  1  shows  the  structure  of  a  Boost  Logic  gate.  The  Logic  stage  can  be  imple¬ 
mented  in  any  transistor  topology  as  long  as  it  supports  the  use  of  clocked  transistors  MS- 
MS.  These  clocked  transistors  decouple  the  Logic  stage  from  the  output  nodes  when  the 
Boost  stage  drives  them.  The  pseudo-NMOS  implementation  shown  in  Figure  1  trades 
off  the  voltage  difference  in  the  pre-resolved  output  nodes  (pseudo-NMOS  gates  do  not 
swing  to  the  full  rail)  for  lower  gate  loading  to  achieve  better  performance  at  higher  oper¬ 
ating  frequencies.  At  lower  operating  frequencies,  the  use  of  dual-rail  CMOS  topology  in 
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Figure  2:  Simulation  waveform  of  Boost  Logic  inverter. 

Logic  offers  the  advantages  of  full-rail  evaluation,  the  lack  of  crowbar  current,  and  reduced 
susceptibility  to  process  variation.  The  DC  power-supply  rails  are  at  voltages 

vu  =  \-(vu+vth) 

K  =  \-(vM-vth) 

Therefore,  the  potential  difference  between  the  supply  rails  in  Logic  is  Vth-  The  Boost 
stage  resembles  back-to-back  inverters,  with  the  only  difference  being  that  and  Gnd 
are  replaced  by  0  and  <f>. 

Figure  2  shows  the  outputs  of  an  inverter  in  a  ring  configuration  of  four  Boost  Logic 
inverters.  During  Logic,  an  initial  potential  difference  is  developed  between  the  two  com¬ 
plementary  outputs.  During  Boost,  Logic  is  deactivated,  and  the  power-clock  waveforms 
drive  the  outputs  to  the  rails  ( Vdd  or  Gnd).  These  outputs  in  turn  drive  fanout  Logic  stages. 
As  the  power-clock  phases  swing  back  and  their  voltage  difference  approaches  Vth,  the 
transistors  in  the  Boost  stage  are  in  cutoff,  isolating  Boost  from  the  outputs.  At  that  time, 
Logic  once  again  begins  to  evaluate. 

To  compare  the  performance  of  Boost  Logic  designs  with  their  conventional  counter¬ 
parts,  we  designed  16-bit  carry-save  multipliers  in  static  CMOS  and  Boost  Logic  using  a 
130nm  bulk  silicon  process.  In  the  Boost  multiplier,  a  two-phase  power-clock  waveform 
was  obtained  using  an  H-Bridge  clock  generator.  The  clock  generator  was  driven  at  the 
target  clock  frequency  of  1GHz.  The  inductor  value  was  set  at  L  =  2.55 nH,  so  that  the 
natural  frequency  of  the  LC  system  formed  by  the  parasitic  capacitance  of  the  circuit  and 
the  inductor  of  the  clock  generator  matched  the  target  operating  frequency  of  1GHz.  The 
static  CMOS  multiplier  was  pipelined  and  voltage-scaled  for  minimum  power,  achieving 
1GHz  operation  with  a  IV  supply  and  8  cycles  of  latency.  With  regular  Vth,  the  latency  of 
the  Boost  multiplier  was  3  times  longer  (24  cycles),  but  its  energy  efficiency  was  5  times 
higher  (15.8pJ  vs.  80.1pJ  per  cycle).  With  low  Vth,  the  latency  of  the  Boost  multiplier  was 
2  times  longer  (16  cycles),  while  its  energy  efficiency  was  almost  10  times  higher  (10.64pJ 
vs.  80.1pJ  per  cycle). 

In  a  fully  integrated  test-chip  we  implemented  using  a  130nm  bulk  silicon  process  and 
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Figure  3:  Block  diagram  of  Boost  Logic  test  chip. 


Figure  4:  Microphotograph  of  Boost  Logic  test  chip. 

on-chip  inductors,  chains  of  Boost  Logic  gates  operate  at  clock  frequencies  up  to  1.3GHz 
with  a  1 .5  V  supply.  Figure  3  shows  a  block  diagram  and  Figure  4  shows  a  microphotograph 
of  our  test  chip.  The  test  structures  on  the  chip  were  8  chains  of  AND,  OR,  XOR,  and  INV 
gates.  Each  chain  had  200  gates.  An  on-chip  H-Bridge  clock  generator  was  used  with  a 
2.4nH  on-chip  inductor.  The  four  clock  generator  switches  were  driven  at  the  target  clock 
frequency. 

Figure  5  gives  measured  current  and  inferred  power  dissipation  in  the  supply  of 
the  test  chip  over  a  range  of  operating  frequencies  from  700MHz  to  1.1  GHz.  Correct 
operation  was  verified  up  to  1.3GHz.  The  natural  frequency  of  the  chip  was  measured 
at  approximately  850MHz.  At  that  frequency,  energy  per  cycle  was  measured  at  26pJ, 
yielding  a  60%  reduction  in  power  dissipation  over  CV2  switching  of  the  same  capacitive 
load. 

2.2  Charge-recovery  ASIC  design 

Another  significant  contribution  of  this  project  has  been  the  development  of  an  ASIC 
methodology  for  charge-recovery  design.  This  methodology  relies  on  a  flip-flop  archi- 
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Operating  Frequency  (GHz) 


Figure  5:  Measured  current  and  corresponding  per-cycle  energy  vs.  frequency. 

tecture  that  can  operate  with  a  sinusoidal  clock  to  yield  a  resonant-clocked  ASIC  with  a 
metal-only  clock  distribution  network. 


Figure  6:  Microphotograph  of  the  resonant  clocked  ASIC  chip. 

To  demonstrate  the  effectiveness  of  our  charge-recovery  ASIC  design  methodology, 
we  have  designed  and  tested  a  synthesized  ASIC  that  performs  a  7-bit  Discrete  Wavelet 
Transform.  The  chip  has  been  fabricated  in  a  250nm  bulk-CMOS  process  through  MOSIS. 
Comprising  close  to  4,000  gates,  our  ASIC  is  clocked  by  a  resonant  charge-recovering 
waveform  of  sinusoidal  shape.  Figure  6  shows  a  microphotograph  of  our  resonant-clocked 
chip  [17].  The  lower  left  comer  of  the  die  contains  our  experimental  energy-recovering 
design  that  consists  of  an  ASIC  core,  an  on-chip  resonant  clock  generator,  and  some  testing 
logic.  The  energy  recovering  flip-flops  are  driven  by  a  resonant  waveform  generated  using 
an  off-chip  surface-mount  inductor  and  the  on-chip  power-clock  generator. 

A  schematic  of  the  energy-recovering  flip-flop  used  in  our  ASIC  is  shown  in  Figure  7. 
This  flip-flop  consists  of  a  charge-recovery  dynamic  buffer  that  drives  a  pair  of  cross- 


7 


Figure  7:  Energy  recovering  sinusoidal  flip-flop  used  in  the  resonant  clocked  ASIC  chip. 


coupled  NOR  gates  as  the  static  latch  element.  Our  flip-flop  latches  on  rising  pulses  of 
power-clock.  The  input  needs  to  be  stable  by  the  time  power-clock  is  roughly  half  way  to 
its  peak,  and  should  be  held  stable  until  power-clock  is  at  the  peak.  The  flip-flop  draws 
more  current  from  the  power-clock  when  active  (i.e.,  the  data  is  changing),  thus  changing 
the  effective  load  seen  by  the  power-clock  generator. 


reference 


reference 

voltage 


Figure  8:  Clock  generator  used  in  ASIC  chip. 

Our  chip  includes  a  single-cycle  feedback  control  resonant  power-clock  generator,  shown 
in  Figure  8,  that  is  capable  of  reacting  to  changes  in  its  load.  The  amplitude  of  the  power- 
clock  signal  is  sampled  and  compared  against  a  reference  level.  The  result  of  this  com¬ 
parison  is  used  to  decide,  on  a  cycle-by-cycle  basis,  whether  or  not  to  turn  on  the  main 
NMOS  power-switch  to  pump  more  energy  into  the  power-clock.  This  control  is  critical 
for  achieving  ultra-low  dissipation  when  the  ASIC  is  idling. 

Figure  9  shows  the  measured  energy  dissipation  of  the  clock  network  in  our  resonant- 
clocked  ASIC  chip  at  several  frequencies  between  100MHz  and  300MHz.  At  each  fre¬ 
quency  point,  the  voltage  was  scaled  down  to  the  minimum  required  for  correct  operation. 
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Figure  9:  Measured  power  consumption  resonant-clocked  ASIC. 

The  inductor  and  DC  supplies  were  connected  externally.  For  reference,  we  plot  a  quadratic 
curve  fit  to  the  function  fCVdd  evaluated  at  each  of  the  voltage,  frequency  pairs.  This  curve 
represents  the  dissipation  required  to  drive  the  same  clock  capacitance  if  charge  recovery 
techniques  were  not  used.  At  /=300MHz,  the  clock  was  overdriven  using  an  inductance 
value  larger  than  1/C(27to;)2,  resulting  in  suboptimal  power  dissipation  at  that  frequency. 
At  205MHz,  the  measured  clock  power  dissipation  was  4.5mW,  about  5  times  less  than 
required  to  drive  the  same  clock  capacitance  with  conventional  means.  These  dramatic 
power  savings  are  due  to  operation  near  the  resonance  of  the  inductor  in  conjunction  with 
the  clock-capacitance. 


Figure  10:  Measured  power-clock  spectrum  at  200MHz. 

In  addition  to  reduced  power  dissipation,  charge  recovery  circuitry  has  the  potential  to 
operate  with  substantially  reduced  electromagnetic  interference.  To  provide  empirical  evi¬ 
dence  in  support  of  this  largely  unexplored  fact,  we  analyzed  the  spectrum  of  the  measured 
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Figure  11:  Microphotograph  of  GHz  resonant  clock  distribution  test  chip. 

power-clock  waveform  when  resonating  at  200MHz.  The  spectrum  obtained  is  shown  in 
Figure  10,  zoomed  in  on  the  region  of  interest  from  0  to  1GHz.  This  data  was  obtained 
by  recording  100,000  voltage  samples  at  lOOps/sample  at  the  off-chip  inductor  terminal. 
Assuming  linear  characteristics  from  the  parasitic  elements  between  the  inductor  terminal 
and  the  on-chip  clock  network,  this  data  should  be  proportional  to  the  actual  clock  signal 
on-chip.  The  graph  shows  the  presence  of  substantially  attenuated  odd  and  even  harmon¬ 
ics.  Specifically,  the  first  3  harmonics  are  22dB,  36dB,  and  43dB  below  the  fundamental, 
respectively.  In  contrast,  the  first  harmonic  of  a  square  waveform  at  600MHz  is  about  12dB 
below  the  fundamental.  The  spur  at  roughly  10MHz  could  be  attributed  to  a  periodicity  in 
the  datapath  self-test  activity,  as  it  corresponds  roughly  to  the  spectrum  of  one  of  the  self¬ 
test  signature  outputs.  An  alternate  hypothesis  is  that  it  results  from  some  coupling  with 
one  of  the  I/O  pads  slewing. 

2.3  GHz-class  resonant  clocking 

Resonant  clock  distribution  has  the  potential  to  reduce  clock  power  and  achieve  low  clock 
skew  and  jitter  [3].  In  this  project,  a  two-phase  resonant  clock  network  with  a  programmable 
driver  and  loading  has  been  designed  and  evaluated.  This  network  uses  a  clock  generator 
that  is  driven  at  a  reference  clock  frequency.  It  also  uses  the  size  and  duty  cycle  of  the 
replenishing  switches  in  the  clock  generator  to  adjust  clock  amplitude.  Programmable 
loading  allows  for  different  balanced/imbalanced  load  configurations,  enabling  the  investi¬ 
gation  of  clock  amplitude,  power,  and  skew  at  resonance  and  off  resonance  for  operating 
frequencies  in  the  900MHz  to  1.2GHz  range.  Included  is  on-chip  circuitry  for  measuring 
skew  and  clock  amplitude. 

Figure  2.3  gives  a  microphotograph  of  our  resonant-clocked  test  chip.  4nH  spiral  induc¬ 
tors  connected  in  parallel  are  placed  symmetrically  around  the  center  of  the  H-tree  clock 
network.  A  single  central  H-bridge  clock  generator  is  used  to  compensate  for  power  losses 
and  maintain  clock  amplitude  using  switches  of  programmable  size  that  are  driven  by  re- 
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Figure  12:  Measured  total  power  dissipation  and  efficiency  vs.  frequency. 
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Figure  13:  Measured  power  dissipation  and  clock  amplitude  vs.  frequency. 


plenishing  pulses  of  programmable  duty  cycle. 

Figure  2.3  gives  measured  total  power  dissipation  as  a  function  of  operating  frequency. 
All  data  are  obtained  with  a  total  capacitance  of  approximately  52pF  per  phase,  yielding  a 
resonant  frequency  of  990MHz.  For  each  data  point,  switch  size  and  duty  cycle  have  been 
chosen  to  yield  minimum  power  dissipation  at  the  corresponding  operating  frequency  while 
maintaining  the  same  average  amplitude  over  all  16  leaf  nodes  in  the  network.  The  curves 
report  average  results  for  4  test  chips.  The  maximum  difference  in  measured  amplitude  and 
power  among  the  4  chips  is  less  than  6%.  Power  dissipation  is  minimized  when  the  system 
is  driven  at  its  resonant  frequency.  Maximum  relative  power  savings  of  the  resonant  clock 
system  over  conventional  CV2  are  in  the  45%  range. 

Figure  2.3  gives  measured  clock  amplitude  and  power  dissipation  as  functions  of  fre¬ 
quency  for  four  switch  size/duty  cycle  configurations  (average  from  4  test  chips).  The 
results  show  that  configurations  with  larger  switch  size  and  smaller  replenishing  duty  cy¬ 
cle  can  dissipate  less  power,  while  maintaining  the  same  amplitude.  When  driving  fre¬ 
quency  is  10%  off  resonance  at  1.1GHz,  power  dissipation  increases  by  3(720/rm,  30than 
the  (580/im.  44(630//m.  44results.  In  general,  larger  switches  reduce  resistance  in  the  clock 
generator  and  increase  clock  amplitude.  Smaller  duty  cycle  reduces  current  from  Vdd  to 
Vss,  hence  lowering  power  dissipation. 


11 


dummy  bit  line  real  bit  line  pair 

BLD  BLT  BLF 


dout 

Figure  14:  CLM  column  with  dummy  bit-line  and  charge  recovery. 

2.4  Charge-recovery  SRAM 

The  application  of  energy  recovery  to  memory  design  is  particularly  compelling,  due  to 
the  substantial  switching  capacitance  in  memory  arrays.  Early  work  on  energy  recovery 
memories  has  used  multiple  power-clocks  [15,  9,  2,  8,  10,  16],  resulting  in  designs  with 
multiple-cycle  latency  and  limited  scalability.  Low-complexity  energy  recovery  memories 
with  a  single-phase  power-clock  have  been  reported  in  [7,  6].  These  memories  exhibit 
data/operation-dependent  variations  in  the  capacitive  load  presented  to  the  power-clock, 
however,  resulting  in  limited  energy  efficiency  due  to  poor  resonance. 

In  this  project,  we  have  explored  CLM,  an  energy  recovery  SRAM  architecture  that 
presents  a  constant  capacitive  load  to  the  power-clock,  regardless  of  memory  operations  or 
data  access  patterns  [5].  In  CLM,  non-selective  precharge  is  used  to  ensure  a  constant  mem¬ 
ory  load  during  write  operations,  regardless  of  data  pattern.  Lurthermore,  when  bit  lines 
are  disconnected  from  the  power-clock  during  nonwrite  cycles,  dummy  bit  lines  of  equal 
capacitance  are  connected  to  the  power-clock,  maintaining  a  constant  memory  load.  CLM 
provides  single-cycle  operations  using  a  single-phase  power-clock  for  low  complexity  and 
efficient  high-speed  operation. 

A  schematic  diagram  of  a  bit-line  in  the  proposed  constant-load  charge-recovery  mem¬ 
ory  is  shown  in  Ligure  14.  In  this  design,  each  bit-line  pair  BLT  and  BLL  is  “shadowed”  by 
a  dummy  bit-line.  Each  bit-line  is  selectively  connected  to  the  column  memory  cells  and  a 
sense  amplifier.  The  dummy  bit-line  has  no  connection  with  the  column  cells  or  the  sense 
amplifier,  however.  During  each  write  cycle,  exactly  one  of  the  drivrs  turns  on,  transferring 
charge  between  the  system  inductor  and  exactly  one  of  the  bit-lines  BLT  or  BLL,  respec¬ 
tively.  During  each  non- write  cycle,  the  driver  DD  turns  on,  connecting  the  dummy  bit-line 
with  the  power-clock.  The  dummy  bit-line  is  designed  so  that  it  presents  approximately 
the  same  load  on  the  power-clock  as  an  actual  bit-line.  Thus,  the  load  of  the  power-clock 
remains  constant  during  write  and  non- write  cycles,  maintaining  a  constant  amplitude  and 
maximizing  energy  efficiency. 
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To  assess  the  performance  of  our  proposed  energy  recovery  SRAM  architecture,  we 
have  designed  a  128x256  SRAM  using  a  250nm  bulk  silicon  process.  In  Spice  simula¬ 
tions  with  a  2.5V  supply,  CLM  functions  correctly  at  clock  frequencies  up  to  400MHz. 
Using  an  ideal  power-clock  waveform,  CLM  achieves  power  reductions  in  excess  of  37% 
over  its  conventional  counterpart  with  a  42/58  write/non- write  operation  mix.  Assuming 
lossless  power-clock  generation,  the  proposed  SRAM  dissipates  38%  less  power  than  its 
conventional  counterpart  at  400MHz,  2.5  V.  When  the  power  dissipation  of  the  power-clock 
generator  is  taken  into  account,  overall  power  savings  are  27%. 
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